CPSC 545/445 (Autumn 2003) - Class 5: Pairwise Sequence Alignment Module 2, Part 1 Module 2: Sequence Alignment Note: Biological sequences evolve (dna: single base substitution, deletion, insertion, ...) Problem: Test two sequences (dna, protein, rna) for similarities => pairwise sequence alignment (psa) Applications: - phylogenetic tree reconstruction - find functional similarities (similar sequences => similar structure => similar function) Need: - definition of type of alignment (dna, rna, protein; two or more sequences) - score for evaluating alignments --- 2.1 Global Pairwise Sequence Alignment Problem: Given: x = x[1] x[2] ... x[n] and y = y[1] y[2] ... y[m] Find an optimal alignment of x and y, i.e. an alignment with maximal score simple case: n=m, assume only base substitutions (no gaps) score = sum of scores s(x[i],y[i]) - these are defined by scoring matrix (e.g., BLOSUM or PAM matrix) no maximisation to be done! more realistic: allow gaps (to model effects of deletions and insertions) problem: where did deletions and insertions occur most likely, i.e., where to place the gaps? definition: a pairwise Sequence alignment of x,y is pair of sequences x',y' possibly containing gaps "-" with: - x' with gaps remove = x, y' with gaps removed = y - length(x') = length(y') - x'[i] = y'[i] = gap never happens the score of an alignment x',y' is defined as the sum of scores over all non-gap aligned pairs (x'[i],y'[i]) + scores for regions containing gaps (additive score) --- gap scoring: gaps are penalised; gamma(g) = score (penalty) for gap of length g linear gap score: for gap of length g, gamma(g)=-gd (d = gap penalty constant, typically = 8 in practice for protein psa) affine gap score: for gap of length g, gamma(g)=-d-(g-1)e (d = gap-open penalty, e = gap-extension penalty, d > e; typically d=12, e=2 in practice for protein psa) motivation for affine gap scores: - gaps are often longer than one residue - mutation can insert more than one residues at once - matching cDNA to genomic DNA (introns/exons) general gap score: arbitrary function gamma(g) (scoring matrices and gap penalties can be derived from and related to probabilistic models of sequence evolution; details later) --- 2.2 finding optimal psa (using linear gap scores) brute-force-approach: - enumerate & score all alignments problem: (2n choose n) ~= 2^2n / sqrt(2 pi n) many! (exponential in n) => infeasible for realistic sequence lengths much better: dynamic programming approach (Needleman-Wunsch algorithm): idea: build up optimal alignments based on optimal alignments for smaller subsequences recursively build 2-dimensional matrix F, where F(i,j) = score of best alignment between x[1..i] and y[1..j] F(0,0)=0 F(i,0)=-id % x aligned to all gaps F(0,j)=-jd % y aligned to all gaps F(i,j) = max { F(i-1,j-1) + s(x[i], y[i]), % align x[i] with y[i] F(i-1,j) - d, % align x[i] with gap F(i,j-1) - d } % align y[i] with gap fill matrix from top left to bottom right, store pointers according to the term in the max above that achieved the maximum => F(m,n) = score of optimal alignment the alignment itself is constructed by tracing back the pointers from cell F(m,n) to F(0,0) (this algorithm works because score = sum of independent pieces) complexity: O(m*n) time and memory (memory complexity can be reduced to O(m+n) with only minor increase in computation time using dynamic programming + divide and conquer approach; see [BSA], Ch.2.6) --- 2.3 local pairwise sequence alignment Often, one wants to align subsequences of two larger sequences. Examples: - locating common domains in proteins (e.g., transmembrane proteins) - comparing extended sections of genomic DNA sequence - detecting similarity between highly diverged sequences (only certain parts of sequence conserved enough for alignment, rest may have accumulated too much noise) local alignments are generally not subsets of global alignments! naïve approach: - pick each possible pair of subsequences from x,y. use global alignment algorithm to align those. -> O(m^2 * n^2) possible subsequences of length O(m), O(n), each global alignment takes time O(m*n) -- dynamic programming approach (smith-waterman algorithm): = modification of needlemen-wunsch algorithm, complexity: O(m*n) time & space (again, linear gap scores) same iterative construction process for dynamic programming matrix F(i,j) as before, but: F(i,j) := max { F(i-1,j-1)+s(x[i],y[j]) F(i-1,j) - d F(i,j-1) - d 0 % try new subsequence alignment } The forth case corresponds to starting a new subsequence alignment at position i in x and j in y, instead of extending the best alignment up to positions i,j so far. Modified initialisation: F(i,0) := max{0,-id} = 0 F(0,j) := max{0,-dj} = 0 (starting and ending gaps are no longer penalised, since we are aligning subsequences only) The score of optimal alignment is no longer necessarily given by F(m,n), but can occur at any matrix position. Likewise, tracing the pointers backwards will typically stop before reaching F(0,0). For local alignment to work, the expected score for a random match (of unrelated sequenc sections) must be negative. Otherwise, long matches between unrelated sequences would have high scores, just based on their length, leading to global or nearly global alignments. -> global vs. local alignment behaviour of smith-waterman highly dependent on scoring system used. For ungapped case, theoretical analysis of the requirements for the expected score of a random match to be negative is possible; it is based on an unrealistic assumption (all sequence positions evolve independently) and does not extend to the (more practically relevant) gapped case. Nevertheless, consequences of the theoretical model in terms of requirements of scoring systems seem to be practically relevant. In practice, balance between local/global behaviour of local psa is achieved through empirical methods (experimentation, measurements). --- Resources: BSA, Ch.2 [good introduction; contains further references]